January 28, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
For models with parameters, our goal is to find a solution that minimizes the empirical risk.
We can express most problems with parameters, generally, in the following form:
\[\hbb = \underset{\bb}{\text{argmin }} \expn L(y_i , f(\bb , \mathbf x_i)) + \lambda C(\bb)\]
where \(C(\cdot)\) is a function of the compexity of the model that increases in complexity.
Given a loss specification, the goal of machine learning methods fitting via the ERM framework is to find the parameters that minimize the loss
Almost all of our commonly used supervised methods fall into this framework. Exceptions:
KNN (there are no parameters)
Tree based approaches (the fitting is done via binary recursive splitting methods)
Maybe some that I’m forgetting? Can y’all think of anything?
For problems with a continuous outcome, the loss function is often mean squared error:
\[\hbb = \argminb \expn (y_i - \phi(\mathbf x_i)^T \bb - \alpha)^2 + \lambda C(\bb)\]
For problems with 2 or more discrete outcomes, the loss function is often cross-entropy:
\[\hbb = - \argminb \expn \sumk I(y_i = k) \log \theta_{ik} + \lambda \sumk C(\bb_k)\]
\[\theta_{ik} = \frac{\exp[\phi(\mathbf x_i)^T \boldsymbol \beta_k]}{\sum \limits_{h = 1}^K \exp[\phi(\mathbf x_i)^T \boldsymbol \beta_h]}\]
Surprisingly, even SVMs fit into this framework!
Suppose that we have two classes labelled -1 and 1. We can define the SVM loss function as:
\[\hbb = \argminb \expn \varphi(1 - y_i(\phi(\mathbf x_i^T \bb - \alpha))) + \lambda \|\bb\|^2\]
where \(\varphi(a) = \text{max}(0,1 - a)\) of the hinge loss
We don’t frequently fit SVMs this way because the hinge loss is not twice differentiable everwhere
Fit most frequently using LIBSVM - a very fast constrained optimization software designed to solve the coresponding quadratic programming problem.
All of these methods require finding the coefficients that minimize the loss function!
The question then becomes how can we find the minimum.
A quick side note:
We sometimes encounter maximization problems. We’ll always switch this to a corresponding minimization problem.
Example: Maximum likelihood estimation
Given a log likelihood, \(\log f(\bb | \mathcal D)\), we want to find:
\[\hbb = \argmaxb \log f(\bb | \mathcal D)\]
Without loss:
\[\hbb = \argminb - \log f(\bb | \mathcal D)\]
Notation Alert: For my own sanity, I’m going to omit the feature extractor function \(\phi(\mathbf x_i)\) from the rest of this lecture. We can always transform the features before we pass them to these different learning methods!
Sometimes, we get lucky and can solve for the minimum by hand.
Example: Least Squares w/ L2 regularization
\[\hbb = \argminb \expn (y_i - \mathbf x_i^T \bb - \alpha)^2 + \frac{\lambda}{N} \sum \limits_{j = 1}^P \beta_j^2\]
\[\hbb = (\mathbf X^T \mathbf X + \lambda \mathcal I_P)^{-1} \mathbf X^T \mathbf y\]
Note that we’re dividing by \(N\) to keep everything as an average over observations.
This will pay off later.
We know that this corresponds to a point in the parameter space where the gradient for the coefficients is equal to zero.
How do we know that this is a minimum?
How do we know that this is a global minimum?
Notation:
Let \(\mathcal L(\boldsymbol \theta)\) be our loss function where \(\theta\) is the set of parameters we would like to learn from the data.
Let \(\nabla \mathcal L(\boldsymbol \theta) = \mathbf g(\boldsymbol \theta)\) be the gradient vector for the loss function w.r.t. the unknowns:
\[\gtheta = \begin{bmatrix} \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_1} \\ \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_2} \\ \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_3} \\ \vdots \end{bmatrix} \]
Jacobians:
A matrix of partial derivatives with the objectives on the rows and the unknowns on the columns.
In other words, each row of the Jacobian is a gradient. For scalar value outputs (like a loss function):
\[ J(\boldsymbol \theta) = \gtheta^T \]
Notation:
Let \(\nabla^2 \mathcal L(\boldsymbol \theta) = \htheta\) be the Hessian matrix:
\[\htheta = \begin{bmatrix} \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_1\partial \theta_1} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_1\partial \theta_2} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_1\partial \theta_3} & \ldots \\ \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_2\partial \theta_1} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_2\partial \theta_2} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_2\partial \theta_3} & \ldots \\ \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_3\partial \theta_1} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_3\partial \theta_2} & \frac{\partial \mathcal L(\boldsymbol \theta)}{\partial \theta_3\partial \theta_3} & \ldots \\ \vdots & \vdots & \vdots & \ddots \end{bmatrix} \]
Can also define the Hessian as:
\[ \frac{\partial J(\boldsymbol \theta)}{\partial \boldsymbol \theta} \]
For a function, a point, \(\boldsymbol \theta^*\) is a local minimum (the lowest point in a neighborhood of the parameter space) if:
\(\mathbf g(\boldsymbol \theta^*) = \mathbf 0\)
\(\mathbf H(\boldsymbol \theta^*)\) is positive definite. For a \(P \times P\) Hessian matrix, we can define positive definiteness as:
\[\mathbf b^T \mathbf H(\boldsymbol \theta^*) \mathbf b > 0 \text{ for all } \mathbf b \in \mathbb R^P\]
Over the set of all points, \(\boldsymbol \theta^*\), that are local minima, the global minimum is the one that achieves the lowest value w.r.t the loss function.
It is always possible that a loss function has multiple sets of unknowns that lead to local minima.
However, we’re not going to worry about that too much
Most common ML algorithms are constructed in such a way that they admit a single minimum
Suppose our loss function is twice differentiable over it domain.
The loss function is strictly convex if \(\htheta\) is positive definite over its entire domain!
Strictly convex functions are guaranteed to a have a single local minimum
Therefore, it must be the global minimum
Let’s go back to L2 regularized linear regression:
\[\mathcal L(\boldsymbol \beta, \alpha) = \expn (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^2 + \frac{\lambda}{N} \sum \limits_{j = 1}^P \beta_j\]
and argue that the minimum:
\[\hbb = (\mathbf X^T \mathbf X + \lambda I_P)^{-1} \mathbf X^T \mathbf y\]
is a global minimum.
Let’s start by finding the gradient for the unknown parameters - \(\bb\)
\[\mathbf g(\bb) = - \expn 2(y_i - \mathbf x_i^T \bb)\mathbf x_i + \frac{2 \lambda}{N} \bb\]
which can be more compactly written as:
\[\mathbf g(\bb) = -\frac{2}{N} (\mathbf y - \mathbf X \bb )^T \mathbf X + \frac{2 \lambda}{N} \bb\]
Now, find the Hessian:
\[\frac{\partial \mathcal L(\boldsymbol \beta, \alpha)}{\partial \beta \partial \beta} = \frac{2}{N} \sum \limits_{i = 1}^N \mathbf x_i \mathbf x_i^T + \frac{2}{N} \lambda \mathcal I_P\]
which can be more compactly written as:
\[\mathcal H(\bb) = \frac{2}{N}\left[ \mathbf X^T \mathbf X + \lambda \mathcal I_P \right]\]
To show that this matrix is positive definite, we only need to look at the definition for positive definiteness:
\[\frac{2}{N} \mathbf b^T \left[ \mathbf X^T \mathbf X + \lambda \mathcal I_P \right] \mathbf b > 0 \text{ for all } \mathbf b \in \mathbb R^P\]
Multiplying everything through:
\[(\mathbf X \mathbf b)^T (\mathbf X \mathbf b) + \lambda \mathbf b^T \mathbf b\]
which is necessarily positive for any \(\mathbf b \neq \mathbf 0\).
Let’s demonstrate this using a sample data set.
Neat!
We have a strictly convex loss function and a unique global minimizer.
Now, let’s move on to the more novel case.
For now, let’s concentrate on binary logistic regression with no penalty.
\[\hbb = \argminb -\expn y_i \log \theta_{i} + (1 - y_i) \log (1 - \theta_{i})\]
\[\theta_{i} = \frac{\exp[z_i]}{1 + \exp[z_i]}\]
\[ z_i = \mathbf x_i^T \boldsymbol \beta \]
Goal: Find \(\boldsymbol \beta\) that minimizes the negative log likelihood.
Let’s work through this one together on the board.
The derivation of the gradient for logistic regression isn’t too difficult
The gradient:
\[\mathbf g(\bb) = \expn \left[\sigma(\mathbf x_i^T \boldsymbol \beta) - y_i\right]\mathbf x_i\]
Unfortunately, neither gradient admits an analytical solution for the values at the critical point!
We’ll need to use computational minimization methods
Before we do this, though, it is useful to check if there exists a unique solution for the coefficients.
With no penalty, we can simplify this problem to just checking that the Hessian w.r.t. \(\bb\) is always positive definite (since we can always just add a column of 1s).
Take a few minutes to try to derive the Hessian for the logistic regression coefficients. Also try to put it in matrix form.
The Jacobian (rewritten in a form that is a little easier to work with):
\[\mathbf J(\bb) = \mathbf g(\bb)^T = \expn \left[\sigma(\mathbf x_i^T \boldsymbol \beta) \mathbf x_i^T - y_i \mathbf x_i^T \right]\]
Some helpful hints:
\(\frac{d}{d\boldsymbol \beta} \sigma(\mathbf x^T \boldsymbol \beta) = \left[\sigma(\mathbf x^T \boldsymbol \beta)(1 - \sigma(\mathbf x^T \boldsymbol \beta))\right]\mathbf x_i\)
\(\frac{d}{d \boldsymbol \beta} \mathbf x_i^T \boldsymbol \beta = \mathbf x_i\)
Order matters! \(\frac{d}{dx} f(g(x)) = \frac{d}{du} f(u) \frac{d}{dx} g(x)\)
The final matrix should be \(P \times P\)
This Hessian can be rewritten in matrix form as:
\[\mathbf X^T \mathbf S \mathbf X\]
where \(\mathbf S\) is a \(N \times N\) matrix with \(\sigma(\mathbf x_i^T \boldsymbol \beta)\left[ 1 - \sigma(\mathbf x_i^T \boldsymbol \beta) \right]\) along the diagonal.
Since \(\mathbf S\) is a diagonal matrix with only positive elements, we can show that:
\[\mathbf b^T \mathbf X^T \mathbf S \mathbf X \mathbf b = (\mathbf b^T \mathbf X^T \mathbf S^{1/2}) (S^{1/2} \mathbf X \mathbf b) = \|\mathbf b^T \mathbf X^T \mathbf S^{1/2}\|^2\]
which must be positive.
Given the gradient and Hessian, how can we find \(\hbb\)?
Let’s start by assuming that we have a “guess” for the parameter values, \(\bb_0\).
We would like to update \(\bb_0\) to \(\bb_1\) such that:
\(\mathcal L(\bb_1) < \mathcal L(\bb_0)\)
Let \(\mathbf d\) be a descent direction and \(\eta > 0\) a step size such that:
\[\mathcal L(\bb_0 + \eta \mathcal d) < \mathcal L(\bb_0)\]
\(\mathbf d\) is a vector that defines a direction in \(\mathbb R^P\)
We’ll just assume that \(\mathbf d\) is unit normed - \(\|\mathbf d \| = 1\). The norm of \(\mathbf d\) doesn’t really matter.
\(\eta\) should be sufficiently small. For now, just assume it’s \(\epsilon\) away from zero.
Given the gradient for the loss function evaluated at our current guess, \(\mathbf g(\bb_0)\), we can define the directional derivative in a direction \(\mathbf d\) as:
\[\mathbf d^t \mathbf g(\bb_0)\]
The law of cosines says that an inner product of two vectors can be defined as:
\[\|\mathbf d\| \|\mathbf g(\bb_0)\| \cos(\theta)\]
where \(\theta\) is the angle between \(\mathbf d\) and \(g(\bb_0)\).
To be most efficient - decrease the loss the most in a single step - we want to move in the direction of steepest descent
\[\mathbf d^* = \underset{\mathbf d}{\text{argmin }} \|\mathbf d\| \|\mathbf g(\bb_0)\| \cos(\theta)\]
We know that \(\|d\|\) is 1.
\(\|\mathbf g(\bb_0)\|\) doesn’t depend on \(\mathbf d\)
\(\cos(\theta)\) ranges between -1 and 1.
For what \(\theta\) does \(\cos(\theta) = -1\)?
Therefore, we want to choose \(\mathbf d\) such that it moves exactly away from the gradient vector!
This means that the direction of steepest descent is the negative gradient!
We are guaranteed that if we move \(\epsilon\) away from \(\bb_0\) in the direction of the negative gradient, that we will decrease the loss with the new guess:
\[\bb_1 = \bb_0 - \epsilon \mathbf g(\bb_0)\]
If we do the same thing at \(\bb_1\)
\[\bb_2 = \bb_1 - \epsilon \mathbf g(\bb_1)\]
we are guaranteed to decrease the loss function even more!
Repeating this a lot of times, we’ll eventually get to a point where:
\[g(\bb_t) = \mathbf 0\]
and the negative gradient is just the gradient. We’re stuck!
Fortunately, this is where we want to land! A local minimum.
The gradient descent algorithm:
Guarantee: with enough iterations and sufficiently small step size, we are guaranteed to find the minimum with this algorithm!
However, the number of iterations needed to get there is inversely related to the step size
In practice, we’d be dust in the wind by the time it finished if we set the step size to the machine minimum
Instead, choose a “small” step size and stop the algorithm when the change in loss is sufficiently small
Record the loss at each candidate value of the coefficients
If the difference between the loss at the next step and the current one is less than a small value (I frequently set mine at \(10^{-8}\)), call it close enough!
How small is small?
Let’s look at this by examining a logistic regression fit.
Gradient descent works!
One issues arise when the step size is either too small or too large
In this case, it needed a really large step size since the loss function has a pretty “smooth” gradient
Also, it’s low dimensional. So, it doesn’t have that many directions to try.
Other problem: computational cost!
For gradient descent, the total computational complexity depends on the step size and the size of training features.
For step size/number if iterations to convergence:
\[\mathcal O(\log 1/\eta)\]
Within each gradient evaluation:
\[\mathcal O(NP)\]
This will not be trivial for modern examples!
Large \(P\)
Large \(N\)
This adds up!
My version of gradient descent for logistic regression takes roughly 1 ms per iteration
Assuming everything scales linearly, a 2D logistic regression problem with 1 million obs would take roughly 5.5 hours to complete in the worst case scenario with 10,000 iterations to convergence!
Is there a time-saving shortcut we can take?
Let’s go back to the initial generalization problem. We want to find a solution that minimizes the generalization error:
\[E_\mathcal T[\mathcal L(y , \hat{y})]\]
We want to approximate this expectation with the empirical risk over the training data:
\[\expn \mathcal L(y_i , \hat{y}_i)\]
Over the training data, we want to evaluate the gradient of the loss at a given value of the unknowns. The derivative of a sum is a sum of the derivatives, so:
\[\frac{1}{N} \mathbf g(\bb; \mathbf X , \mathbf y) = \expn \mathbf g(\bb ; \mathbf x_i , y_i)\]
For problems that attempt to minimize the empirical risk of i.i.d. draws, the gradient is the average of the gradient evaluations at each training point!
A thought: rather than evaluating the gradient at all \(N\) training points, sample a minibatch of size \(B << N\) from the training data, \(\mathcal B\). Approximate the gradient evaluated at \(\bb\) with:
\[\frac{1}{B} \sum \limits_{j \in \mathcal B} \mathbf g(\bb ; \mathbf x_j , y_j)\]
It turns out that this is a really good approximation for
\[\frac{1}{N} \mathbf g(\bb; \mathbf X , \mathbf y)\]
Any thoughts as to why?
A sampled variant of gradient descent:
When \(B = 1\), this method is called stochastic gradient descent
When \(B > 1\), this method is called batch gradient descent
A natural question:
Does this actually work? Can we just get a noisy approximation of the gradient at each step and eventually land where we want to land?
The proof that SGD converges to a minimum is a little more involved than the one for GD.
Some key points:
Due to noise in the training sample, any one update is not guaranteed to lower the loss
However, as long as the samples are truly i.i.d., each average (even over just one observation) is unbiased for the full sample gradient
In expectation we’re moving in the correct direction each time
However, we’re not guaranteed to converge to the point corresponding to the minimum!
The biggest benefit is decreased computation time. For SGD, the computational complexity is \(\mathcal O(P/\eta)\)
We lose the log decrease w.r.t. to the step size for linear decrease
But, we also lose the increase due to \(N\)!
Other benefits:
Won’t waste time recomputing the gradient for training instances that are largely the same - \(\mathbf x_i \approx \mathbf x_j\)
Our initial guess is probably pretty bad. Quickly compute some estimate of the gradient and get out of there in, hopefully, a negative direction.
Let’s look at this by examining a logistic regression fit.
With \(B = 1\), it just stops…
Stopping rule:
If the values change by less than \(\epsilon\), terminate
Not a good idea for SGD. Why?
Alteration 1:
Just run the algorithm for a fixed number of iterations not stopping when the loss “plateaus”
That works! But heavily dependent on step size
When \(\eta\) is too small, we really don’t get there
When \(\eta\) is too big, we have a large noise ball
Why does the noise ball occur?
An aside:
How do we choose the batches? What constitutes an iteration?
Convergence properties hold when the batches are sampled randomly
From an engineering perspective, extremely inefficient for large data sets
Load in random sets on demand
Load in all data and access when needed
Both are prohibitively sloooooowwwwww when \(N\) is large!
Shuffle/pass method:
At the beginning of an epoch, shuffle the data set and split into exclusive sets with batch size
Each batch is used to compute gradient move sequentially.
After touching all data, end epoch and reshuffle
Repeat
More efficient because method can pre-cache the next data set while gradient computation is occurring
This is the standard algorithm for PyTorch/Tensorflow
Two approaches to make the noise ball less problematic.
Approach #1: Learning Rate Schedules
Rather than using a fixed step size, start with a large step size and temper the learning rate as the algorithm proceeds.
Most common scheduler: Polynomial Decay
\(\eta_t = \eta_0(b t + 1)^{-a}\)
This schedule starts with large steps then quickly decays to smaller steps.
This make sense:
Our initial guess is probably pretty bad. Move away from it in a hopefully decreasing direction quickly.
Once we move onto a path that is decent, start moving a little slower.
At the end, we’re hopefully around the minimum and we’ll tiptoe towards the actual minimum
With polynomial scheduling, we get to the minimum very quickly with just 1 example per iteration!
Fast for logistic regression
Will make computation time much smaller for more complicated methods.
Approach #2: Iterate Averaging
Run through the SGD algorithm as usual computing
\[\bb_{t + 1} = \bb_{t} - \eta \mathbf g(\bb_t)\]
However, denote \(\bar{\bb}\) as the running average of the coefficients at each step
\[\bar{\bb}_{t} = \frac{1}{t} \sum \bb_t = \frac{1}{t} \bb_t + \frac{t - 1}{t} \bar{\bb}_{t - 1}\]
Use \(\bar{\bb}_t\) instead of the final \(\bb_t\) as the estimate of the loss minimizer!
Iterate averaging can improve on SGD by forcing the actual estimate to be an average over the giant noise ball.
No learning schedule required
This procedure is actually guaranteed to provide optimal variance over vanilla SGD implementations.
The bias of the estimated gradient decays geometrically
The variance is optimal (e.g. achieves the smallest variance that can be achieved at any iteration)
All in all, SGD is really quick!
Note that this method will work given any gradient for the unknowns as long as the objective can be expressed as an average.
However, vanilla SGD is still pretty inefficient
It wanders around in directions that are not moving directly towards the minimum
Wasted time…
Next class, we’ll talk about modern improvements to the SGD algorithm that will help it move more efficiently
Momentum, Adaptive Gradients, and ADAM.